Skip to content

[refactor](be) Support recursive parquet complex readers#64357

Open
suxiaogang223 wants to merge 32 commits into
apache:refact_reader_branchfrom
suxiaogang223:codex/parquet-complex-type-contract
Open

[refactor](be) Support recursive parquet complex readers#64357
suxiaogang223 wants to merge 32 commits into
apache:refact_reader_branchfrom
suxiaogang223:codex/parquet-complex-type-contract

Conversation

@suxiaogang223

Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: The new parquet reader needs one complete path for nested complex types instead of separate legacy top-level LIST/MAP handling. This PR defines the nested shape/value contract, keeps nested predicate pushdown scoped to STRUCT/nested STRUCT, adds recursive reader composition for LIST, MAP, STRUCT, and scalar leaves, removes the old complex reader fallback paths, and adds unit coverage for deep nested shapes with null and empty levels.

Release note

None

Check List (For Author)

  • Test: Unit Test
    • Remote fedora: ./run-be-ut.sh -j 8 --run --filter="ParquetColumnReaderTest.*"
    • Remote fedora: ./run-be-ut.sh -j 8 --run --filter="NewParquetReaderTest.*"
  • Behavior changed: No
  • Does this need documentation: Yes

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223 suxiaogang223 force-pushed the codex/parquet-complex-type-contract branch 2 times, most recently from 11a0f4f to b57b439 Compare June 10, 2026 07:29
@suxiaogang223 suxiaogang223 marked this pull request as ready for review June 10, 2026 07:33
@Gabriel39

Copy link
Copy Markdown
Contributor

/review

for (const auto& child : field.children) {
const auto child_projection_it =
std::ranges::find_if(projection.children, [&](const LocalColumnIndex& child_proj) {
return child_proj.field_id() == child.file_local_id();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field_id这个名字不好

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking correctness issue in nested parquet predicate pruning. The recursive complex readers and projection reshaping look coherent overall, but the new pruning path treats FLOAT -> DOUBLE as safe and then rewrites the predicate against float file stats using the rounded literal, which can prune row groups that still satisfy the original cast expression.

Code-review checkpoint conclusions:

  • Goal/test: implements recursive parquet complex reader support and adds BE tests; the issue below breaks correctness for float-to-double nested filters.
  • Scope/focus: scope is aligned with BE parquet reader/refactor work; I avoided duplicating the existing naming thread in schema_projection.cpp.
  • Concurrency/lifecycle/config/compatibility: no new concurrency, lifecycle, static-init, config, or FE/BE compatibility issue found.
  • Parallel paths: parquet stats/page/dictionary pruning and recursive reader paths were checked; the problem is in the shared nested predicate extraction path.
  • Transactions/data writes: not applicable.
  • Performance/observability: no separate blocking performance or metric issue found.
  • Tests: PR adds targeted BE tests; I did not run tests in this review. Additional coverage should include float-to-double nested predicates that must not prune matching row groups.
  • User focus: no additional review focus was provided.

}
if (from_primitive_type == TYPE_FLOAT && to_primitive_type == TYPE_DOUBLE) {
return true;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can incorrectly prune matching row groups for FLOAT -> DOUBLE nested filters. After this branch lets extract_nested_struct_path_for_pruning() strip the cast, build_nested_comparison_predicate() converts the double literal back to FLOAT and keeps the same opcode. For example, with a float leaf value equal to float(0.1), CAST(s.a AS DOUBLE) > 0.1 is true because double(float(0.1)) > 0.1, but the generated file predicate becomes s.a > float(0.1) and can reject a row group whose min/max are exactly that value. NE has the same false-negative shape. Please avoid stats/page/dictionary pruning for float-to-double casts unless the literal and comparison opcode are adjusted with a bound-aware rule.

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Add a single contract document for the final new parquet reader complex type implementation model. The new document defines reader boundaries, nested path semantics, shape/value separation, schema evolution, pruning safety rules, lazy materialization, and phased rollout. It also removes the older struct primitive predicate proposal document that is superseded by the complete contract.

### Release note

None

### Check List (For Author)

- Test: No need to test (documentation-only change)
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Clarify that the current nested parquet predicate pushdown is a transitional mapper-side extension for STRUCT and nested STRUCT primitive leaves only. Document that LIST/MAP/repeated predicate pushdown should wait for a ColumnPredicate or nested filter target refactor, and add an inline comment near the mapper extraction logic to prevent extending the transitional path beyond struct semantics.

### Release note

None

### Check List (For Author)

- Test: No need to test (documentation and comment-only change)
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Refine the complex type contract so the transitional nested predicate pushdown model is STRUCT-only and DuckDB StructFilter-like. Remove LIST/MAP predicate pushdown and repeated pruning planning from the contract, while keeping LIST/MAP reader coverage as part of complex type reading.

### Release note

None

### Check List (For Author)

- Test: No need to test (documentation-only change)
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Clarify that unsupported file-layer pruning does not mean unsupported row-level filtering. Complex predicates that cannot produce pruning hints must still be read through predicate projection and evaluated by localized VExprContext, especially during lazy materialization predicate phase.

### Release note

None

### Check List (For Author)

- Test: No need to test (documentation-only change)
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Refocus the complex type contract on nested complex type reading and a unified Arrow-style Dremel shape/value abstraction. Clarify that row-level filtering, file-layer pruning, and lazy materialization are consumers of the nested shape model, and add shape builder and testing expectations.

### Release note

None

### Check List (For Author)

- Test: No need to test (documentation-only change)
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add Phase 0 safety coverage for the new parquet reader complex type contract. The tests lock down that nested file-layer pruning remains struct-only during the transition, while list/map row-level predicates still force predicate projection and keep filter-only children out of final output mapping.

### Release note

None

### Check List (For Author)

- Test: No need to test (per request; verification will be run at a later implementation stage)
- Behavior changed: No
- Does this need documentation: No
Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Nullable STRUCT reading previously depended on scalar child level streams to construct parent null shape. When the projected children were all complex, such as LIST/MAP children or nested STRUCT children that only contain complex descendants, the reader could not derive the STRUCT validity while preserving the top-level complex column layout. This change adds a transitional nested shape channel so complex readers can materialize values and expose ancestor null shapes from the same Dremel stream, validates sibling shapes, and covers nullable struct shape sources from complex descendants.

None

- Test: No need to test (not run per request; verification deferred)
- Behavior changed: Yes (nullable STRUCT with only complex projected children can derive parent shape)
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The parquet nested shape reader channel collected parent shape null maps in a vector of NullMap values. StructColumnReader later attempted to copy-assign one NullMap into a local variable before applying the parent shape. NullMap is backed by PODArray and its copy assignment operator is deleted, so the BE UT build failed before running NewParquetReaderTest. This change uses a const reference to the selected parent shape map instead of copying it.

### Release note

None

### Check List (For Author)

- Test: Remote BE UT failed at compile before this fix; rerun pending.
- Behavior changed: No
- Does this need documentation: No
Issue Number: close #xxx

Related PR: #xxx

Problem Summary: New parquet complex readers handled nested LIST/MAP combinations with type-specific state machines, so combinations such as array<map<...>> and map<K,map<...>> were rejected or could not be materialized recursively. This change adds a recursive nested batch/build channel backed by Doris schema level information, lets LIST and MAP readers delegate to child readers instead of enumerating combinations, and adds unit coverage for LIST of MAP and MAP of MAP.

Support additional nested Parquet complex type combinations in the new reader.

- Test: Unit Test

    - Attempted ./run-be-ut.sh --run --filter=ParquetColumnReaderTest.ReadSupportedComplexTypes -j 4 locally, but macOS toolchain configuration failed before compilation with ld: library 'c++' not found. Remote BE UT will be run on fedora.

- Behavior changed: Yes. New parquet reader can construct recursive LIST/MAP complex type combinations.

- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The new Parquet map-of-map unit test data only appended four top-level rows while the test fixture requires five rows. This fixes the test builder by appending the final empty map row so the generated Arrow table is valid.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - ParquetColumnReaderTest.* will be rerun on fedora.

- Behavior changed: No

- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Recursive complex reader support changed top-level LIST and MAP reads to always use the nested batch build path. Existing scalar, struct, and list/map combinations already had overflow-aware read_internal paths, and bypassing them regressed chunk-boundary, skip, selected read, and projected complex column cases. Keep the legacy top-level path for combinations it already supports, while using the new recursive build path only for recursive-only LIST<MAP> and MAP<MAP> shapes.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran build-support/clang-format.sh and git diff --check locally. Fedora ParquetColumnReaderTest rerun will be triggered after pushing.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Recursive LIST<MAP> materialization treated container-only map shape levels as nullable scalar value slots. This inserted an extra default value before real map scalar values and shifted nullable string payloads. Skip nested scalar levels below the scalar materialization slot, while still preserving real nullable scalar null slots.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran build-support/clang-format.sh and git diff --check locally. Fedora ParquetColumnReaderTest rerun will be triggered after pushing.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Recursive nested MAP materialization built keys from MAP entry levels but built scalar values independently from the scalar level stream. In LIST<MAP> and nested MAP shapes, scalar value streams can contain shape-only levels before the next real map entry, shifting nullable scalar values. Track the MAP entry level indices while assembling offsets and append scalar MAP values from those same indices.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran build-support/clang-format.sh and git diff --check locally. Fedora ParquetColumnReaderTest rerun will be triggered after pushing.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Nested MAP key and scalar value level streams can diverge around value-only shape slots such as null nested containers. Appending scalar values by the key level index can therefore consume a shape/null slot before the real entry value. While appending nested MAP scalar values, scan the value stream and match the key entry repetition level, skipping non-entry value shape slots.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran build-support/clang-format.sh and git diff --check locally. Fedora ParquetColumnReaderTest rerun will be triggered after pushing.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Matching only MAP entry levels still allowed scalar value streams to consume value-only shape slots around null nested containers. Track all processed MAP key shape levels and consume the scalar value stream for each shape event, appending a value only when the key shape represents a real MAP entry.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran build-support/clang-format.sh and git diff --check locally. Fedora ParquetColumnReaderTest rerun will be triggered after pushing.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Recursive nested MAP build aligned scalar value levels to key levels by scanning for the next equal repetition level. Under LIST<MAP<...>>, outer list shape slots can share a repetition level with later map entries, so the scalar value stream could shift and materialize the wrong string value. This change treats MAP key and scalar value as fields of the same repeated entry struct: their definition levels may differ, but their shape slot index and repetition level must align.

### Release note

None

### Check List (For Author)

- Test: Pending remote BE UT rerun for ParquetColumnReaderTest.ReadSupportedComplexTypes.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Nested MAP build still misaligned scalar values for LIST<MAP<...>> when the parent tried to match value slots against key slots. The scalar value stream can include its own shape slots around nullable or empty containers, so parent-side slot matching is ambiguous. This change follows the recursive nested reader contract: MAP builds entry shape and keys, then asks the value reader to build exactly the number of materialized entry values from its own def/rep stream.

### Release note

None

### Check List (For Author)

- Test: Pending remote BE UT rerun for ParquetColumnReaderTest.ReadSupportedComplexTypes.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Recursive nested MAP build let scalar value readers materialize the first N value-shaped slots directly. For LIST<MAP<...>>, shape-only slots from empty or null MAP elements can have the same nullable scalar definition level as a real null value, which shifts later scalar values. This change keeps MAP entry ownership in the key stream and consumes scalar value shape slots in lockstep with key shape slots, only appending a value when the key slot represents a real MAP entry.

### Release note

None

### Check List (For Author)

- Test: Pending remote BE UT for ParquetColumnReaderTest.ReadSupportedComplexTypes.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Recursive nested scalar reads passed the max definition level as the value slot threshold. For nullable scalar leaves, the parquet record reader materializes null placeholders in the payload column, so value indices must advance for the nullable slot level as well as non-null values. Otherwise LIST<MAP<..., nullable scalar>> shifts later values after the first null. This change makes nested scalar batch loading use the materialized scalar slot definition level and makes column build consume the same threshold.

### Release note

None

### Check List (For Author)

- Test: Pending remote BE UT for ParquetColumnReaderTest.ReadSupportedComplexTypes.
- Behavior changed: No
- Does this need documentation: No
Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Recursive parquet complex-type reads could shift scalar children after a null STRUCT element or MAP value. The reader inserted a default child slot for the null STRUCT parent, but the nested leaf value index stream did not account for Arrow RecordReader placeholders emitted for null ancestors. LIST<STRUCT<...>> and MAP<..., STRUCT<...>> could therefore materialize the next real scalar value one slot late. This change keeps recursive complex readers on the nested load/build path, tracks STRUCT parent level positions for null parent alignment, and makes nested leaf value index generation advance across null-ancestor placeholders while only exposing materialized value slots.

None

- Test: Unit Test
    - On fedora: ./run-be-ut.sh -j 8 --run --filter="ParquetColumnReaderTest.ReadSupportedComplexTypes"
    - On fedora: ./run-be-ut.sh -j 8 --run --filter="ParquetColumnReaderTest.*"
    - On fedora: ./run-be-ut.sh -j 8 --run --filter="NewParquetReaderTest.*"
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The new parquet complex type reader now has a recursive shape/value build path for LIST, MAP, and STRUCT. Keeping the old per-type read_internal state machines left two implementations for the same nested type semantics and made future fixes risky. This change removes the legacy complex reader paths and old overflow/helper abstractions, routes top-level and nested complex reads through load_nested_batch/build_nested_column, and keeps STRUCT ancestor shape exposure on the recursive level stream.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Remote fedora: ./run-be-ut.sh -j 8 --run --filter="ParquetColumnReaderTest.*"
    - Remote fedora: ./run-be-ut.sh -j 8 --run --filter="NewParquetReaderTest.*"
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: The recursive parquet complex reader handles nested LIST, MAP, and STRUCT combinations through one shape/value path. Existing tests covered common shallow combinations, but did not stress deeper nesting with null and empty shapes at multiple levels. This change adds deep LIST/STRUCT/MAP/LIST and MAP/LIST/MAP fixtures, then validates normal reads, skipped reads, selected reads, and chunked reads across those shapes.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Remote fedora: ./run-be-ut.sh -j 8 --run --filter="ParquetColumnReaderTest.*"
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: New parquet reader complex projection needs to keep the physical file layout separate from table output layout when nested schema evolution is involved. Without this, nested complex columns with renamed, reordered, missing, or predicate-only struct children can build file reader types in table order and then read the wrong child ordinal from the file-local block. This change records projected file children in file schema order, sorts nested scan projections by file-local id, reconstructs table output columns from the projected file layout, and keeps nested STRUCT predicate pushdown working for predicate-only siblings while continuing to reject LIST/MAP predicate pushdown by contract.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - ./run-be-ut.sh -j 8 --run --filter="TableReaderTest.*:TableColumnMapper*:LocalColumnIndexTest.*"
    - ./run-be-ut.sh -j 8 --run --filter="ParquetColumnReaderTest.*:NewParquetReaderTest.*"
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: New parquet reader nested pruning was still represented by a raw file_child_id_path beside primitive ColumnPredicate objects. That made the STRUCT-only contract implicit and left file-layer pruning code tied to a transitional path vector. This change introduces a struct-only FileNestedPredicateTarget, makes TableColumnMapper emit it for nested STRUCT primitive leaf predicates, keeps ColumnPredicate as a primitive pruning predicate, and makes Parquet reader/statistics pruning consume the target through compatibility accessors while old direct filter construction still works during migration.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - build-support/clang-format.sh be/src/format_v2/file_reader.h be/src/format_v2/file_reader.cpp be/src/format_v2/column_mapper.cpp be/src/format_v2/parquet/parquet_reader.cpp be/src/format_v2/parquet/parquet_statistics.cpp be/test/format_v2/parquet/parquet_reader_test.cpp
    - git diff --check
    - ./run-be-ut.sh -j 8 --run --filter="TableReaderTest.*:TableColumnMapper*:LocalColumnIndexTest.*:ParquetColumnReaderTest.*:NewParquetReaderTest.*"
- Behavior changed: No
- Does this need documentation: Yes
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: New parquet reader nested file-layer pruning only recognized comparison and IN predicates on STRUCT leaf paths. This left nested IS NULL / IS NOT NULL predicates and safe widening cast wrappers as row-level filters only, even though Parquet leaf statistics can prune those cases safely for STRUCT / nested STRUCT primitive leaves. This change adds mapper extraction for nested null predicates and conservative order-preserving cast stripping for pruning targets while keeping row-level filter evaluation unchanged.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Added TableColumnMapperTest coverage for nested IS NULL / IS NOT NULL, safe cast pruning, and unsafe cast rejection.
    - Attempted ./run-be-ut.sh -j 8 --run --filter="TableColumnMapperTest.*" locally, but CMake failed before compiling Doris because the local macOS clang linker could not find libc++.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The temporary new parquet reader complex type contract has been used to drive the current scoped implementation. Phase 6 decoder-level lazy materialization is explicitly deferred, and nested runtime filter pushdown remains outside the current transition scope. The remaining implemented contract is now covered by code and focused tests, so the planning document is removed to avoid stale guidance.

### Release note

None

### Check List (For Author)

- Test: No need to test (documentation cleanup only)
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: After rebasing the parquet complex type branch onto the updated refact_reader_branch, two TableReader tests still initialized the removed TableReadOptions::profile field. This caused BE UT compilation to fail before the targeted parquet tests could run. Remove the stale designated initializer and rely on scanner_profile, matching the current TableReadOptions contract.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - BE UT attempted on fedora with ./run-be-ut.sh -j 8 --run --filter="NewParquetReaderTest.*:ParquetColumnReaderTest.*:TableColumnMapperTest.*"; compilation failed before this fix with stale TableReadOptions::profile initializers.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The parquet column reader unit test mixed two different checks: whether a physical type can be decoded through the RecordReader path and whether a top-level repeated primitive can be created as a flat scalar reader. After repeated INT32 became RecordReader-capable, the old test still expected supports_record_reader to return false and failed. Narrow the unsupported RecordReader test to the decimal precision case that is actually rejected by the current type descriptor, and add a separate test that verifies repeated primitive columns are rejected by scalar reader creation because of their repetition level.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - ./run-be-ut.sh -j 8 --run --filter="NewParquetReaderTest.*:ParquetColumnReaderTest.*:TableColumnMapperTest.*" on fedora
- Behavior changed: No
- Does this need documentation: No
Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Nested parquet reads relied directly on Arrow RecordReader::values_written() and RecordReader::values() while also materializing binary values through BinaryRecordReader::GetBuilderChunks(). For BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY leaves, Arrow stores values in builder chunks instead of the fixed values buffer, and GetBuilderChunks() transfers those chunks out. This made nested binary/string leaf value counts inconsistent with the level stream and could also consume binary chunks before materialization. Normalize one already-read Arrow leaf batch into a local ArrowLeafBatch object so fixed-width and binary payloads expose a single value count/materialization contract, and call GetBuilderChunks() exactly once for binary leaves.

Fix nested Parquet complex type reading for binary/string leaf values.

- Test: Regression test
    - JAVA_HOME=/home/socrates/jdk-17.0.13 ./run-regression-test.sh --run -d external_table_p0/tvf -s test_local_tvf_with_complex_type_insertinto_doris on fedora
- Behavior changed: Yes, nested Parquet complex type reads now correctly handle binary/string leaf payload counts
- Does this need documentation: No
@suxiaogang223 suxiaogang223 force-pushed the codex/parquet-complex-type-contract branch from fa13f44 to 61c997d Compare June 10, 2026 16:31
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The new parquet reader now derives nested parent shape through the load/build nested column flow. The older read_with_ancestor_shape API and its helper sink code no longer have call sites, so keeping them adds unused surface area to the column reader interface. This change removes the unused ancestor shape API from the parquet column reader hierarchy and deletes the associated helper functions.

### Release note

None

### Check List (For Author)

- Test: Manual test

    - Ran git diff --check

    - Searched for removed ancestor shape API references

    - build-support/check-format.sh could not run because clang-format version 16 is not available in this environment

- Behavior changed: No

- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: The parquet scalar reader had separate free-function entry points for top-level leaf reads and nested leaf reads. This made the normalized Arrow record batch state implicit in helper internals and left nested scalar materialization coupled to the old free function API. This change exposes a lightweight ParquetLeafBatch and introduces ParquetLeafReaderAdapter so both top-level and nested scalar readers read the same normalized batch before applying their own materialization logic.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check
    - build-support/check-format.sh attempted but blocked because available clang-format is not version 16
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: The parquet leaf reader helper was still named as an Arrow adapter and kept its class state inside a separate context struct. The nested helper file also mixed scalar leaf state with nested column materialization helpers. This change renames the leaf reader module and public types around the Parquet leaf reader contract, moves the nested scalar batch into the leaf reader contract, removes the context wrapper from ParquetLeafReader, and narrows the nested column helper module to materialization utilities.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --cached --check
    - build-support/check-format.sh attempted but blocked because available clang-format is not version 16
- Behavior changed: No
- Does this need documentation: No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants